Here is a fact every AI demo conveniently ignores: your LLM provider will go down. Not "might." Will. They rate-limit you mid-launch, deprecate the model your app depends on, raise prices overnight, and have regional outages. On the day that happens, a single-provider app is simply down — and you find out from your users.

An LLM gateway with a fallback chain is the unglamorous layer that fixes this. It routes each request down a priority list of providers until one succeeds, trips a circuit breaker on the ones that fail, and keeps your agent answering. This is how I built Agent-Routing, extracted from 18 months of running it in production.

1. Fallback Chains Per Task Class

Not every task wants the same provider. Code generation, prose, and cheap classification have different best-fit models and different cost tolerances. So the router keys a separate priority chain to each task class:

Per-task fallback chains code: openai -> gemini -> anthropic -> ollama content: gemini -> openai -> anthropic -> ollama ui: gemini -> openai -> ollama simple: openai -> gemini -> ollama

The rule is simple: try the first provider; on failure or timeout, fall through to the next. No API key for a provider? It's skipped. All cloud providers down? A local Ollama model serves the request. The agent never fully stops. My portfolio's own assistant runs a chat chain of free providers this way — NVIDIA NIM → Groq → OpenCode Zen → OpenRouter with a paid model held back as the last-resort safety net.

Design the chain around blast radius, not just quality Put the provider you can most afford to lose first, and your most reliable (or local) provider last. The chain is a reliability gradient, not a leaderboard.

2. The Session Circuit Breaker

Naive fallback has a hidden cost: if a provider is down, every single request still tries it first, waits for the timeout, and only then falls through. Under load that's thousands of wasted, slow calls. The fix is a circuit breaker — once a provider fails all its models, mark it OPEN for the rest of the session and skip it entirely:

Circuit breaker tripping CIRCUIT OPENED: openai failed all models and is marked down for this session.

This is the difference between "degrades gracefully" and "hangs for everyone." The first failure pays the timeout cost; every request after it routes straight to a healthy provider.

3. Budget-Aware Routing

Reliability without cost control is how you wake up to a four-figure bill. The router can take a per-call budget and only route to models whose estimated cost fits:

JavaScript — budget cap per request await router.chat(prompt, system, 'code', 0.3, 4096, budget: 0.001); // Only routes to models whose estimated cost fits within $0.001

Combined with a free-provider-first chain, this means the expensive model only ever runs when the cheap ones have genuinely failed — the cost ceiling and the reliability floor are the same mechanism.

4. Token Optimization (Free Savings on Every Call)

Before a prompt is sent, the router compresses it: filler words and phrases removed, whitespace collapsed. On verbose prompts that's a routine 10–20% reduction in input tokens — and since you pay per token on every provider in the chain, that saving compounds across every fallback attempt.

5. Guardrails: The Gateway Is Also a Checkpoint

A single choke point for all LLM traffic is the perfect place to enforce safety. The router checks input for prompt injection before it ever reaches a model, and scans output to redact leaked secrets — API keys, JWTs, database connection strings — before returning:

You can't bolt this on at 40 call sites. You get it for free when every call goes through one gateway.

Per-task
a separate fallback chain for code, content, ui, and cheap tasks
OPEN
circuit breaker skips a downed provider for the rest of the session
1 choke point
budget caps, token trimming, and guardrails enforced in one place

What I Built

Agent-Routing implements all five: per-task fallback chains, session circuit breakers, budget-aware routing, token optimization, and injection/secret guardrails. The demo runs in mock mode with no API keys. The same router powers the grounded assistant on this site — see how I built the chatbot for it running live, and the concepts behind multi-agent systems for where routing fits in the bigger picture.

If you're tired of AI hype, this is the kind of boring layer that actually decides whether your app survives contact with production.